Back

NLP on Medical Abstracts

Multi-Label Classification and Extractive Summarization of Medical Abstracts.

This project was developed in collaboration with Paolo Caggiano for the Text Mining course in the Master Degree in Data Science.

The growing volume of biomedical literature presents a significant challenge in quickly retrieving relevant information. Our project addresses this by combining traditional Natural Language Processing techniques to classify and summarize medical abstracts.

The first part of the project focuses on Multi-Class Multi-Label classification of medical documents across five diagnostic categories. We explore and compare multiple combinations of text preprocessing (basic cleaning, stop-word removal, lemmatization), feature extraction (BoW, TF, TF-IDF, word embeddings), and feature selection methods (rare word removal, PCA). We then apply four different classifiers: Naive Bayes, Decision Trees, Random Forests, and SVMs, evaluating performance through F-score with 5-fold cross-validation. Best results (71.1% F1 score) were obtained using pretrained biomedical word embeddings and SVMs.

In the second part, we apply extractive summarization techniques (Graph-based PageRank and Latent Semantic Analysis) to the abstracts. Summaries were evaluated using ROUGE scores against article titles and by assessing their utility in reproducing classification results. Our graph-based summarizer outperformed both LSA and random baselines, retaining more relevant information for downstream tasks.

The project showcases our ability to apply a full NLP pipeline, from document representation to classification and summarization, emphasizing model evaluation and empirical comparison of classical techniques. Future work is directed toward integrating contextualized embeddings like Med-BERT and exploring abstractive summarization.

The complete code and methodology are detailed in the Github page and in the downloadable report below.

Tags

NLP Text Mining Classification Summarization Word Embeddings Feature Selection